Search CORE

27 research outputs found

Compressed multiple pattern matching

Author: Kosolobov D.
Sivukhin N.
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2019
Field of study

Peer reviewe

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Lempel-Ziv Parsing for Sequences of Blocks

Author: Kosolobov D.
Valenzuela D.
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

The Lempel-Ziv parsing (LZ77) is a widely popular construction lying at the heart of many compression algorithms. These algorithms usually treat the data as a sequence of bytes, i.e., blocks of fixed length 8. Another common option is to view the data as a sequence of bits. We investigate the following natural question: what is the relationship between the LZ77 parsings of the same data interpreted as a sequence of fixed-length blocks and as a sequence of bits (or other “elementary” letters)? In this paper, we prove that, for any integer b > 1, the number z of phrases in the LZ77 parsing of a string of length n and the number zb of phrases in the LZ77 parsing of the same string in which blocks of length b are interpreted as separate letters (e.g., b = 8 in case of bytes) are related as zb = O(bz lognz ). The bound holds for both “overlapping” and “non-overlapping” versions of LZ77. Further, we establish a tight bound zb = O(bz) for the special case when each phrase in the LZ77 parsing of the string has a “phrase-aligned” earlier occurrence (an occurrence equal to the concatenation of consecutive phrases). The latter is an important particular case of parsing produced, for instance, by grammar-based compression methods. © 2021 by the authors. Licensee MDPI, Basel, Switzerland.Funding: This research was funded by the Ministry of Science and Higher Education of the Russian Federation (Ural Mathematical Center project No. 075-02-2021-1387)

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Detecting One-variable Patterns

Author: A Amir
A Ehrenfeucht
D Angluin
D Kosolobov
D Kosolobov
E Czeizler
F Manea
G Manacher
J Kärkkäinen
JEF Friedl
M Crochemore
M Crochemore
M Lothaire
M Rubinchik
ML Schmid
P Gawrychowski
Z Galil
Z Xu
Publication venue
Publication date: 01/01/2017
Field of study

Given a pattern

p = s_1x_1s_2x_2\cdots s_{r-1}x_{r-1}s_r

such that

x_1,x_2,\ldots,x_{r-1}\in\{x,\overset{{}_{\leftarrow}}{x}\}

, where

x

is a variable and

\overset{{}_{\leftarrow}}{x}

its reversal, and

s_1,s_2,\ldots,s_r

are strings that contain no variables, we describe an algorithm that constructs in

O(rn)

time a compact representation of all

P

instances of

p

in an input string of length

n

over a polynomially bounded integer alphabet, so that one can report those instances in

O(P)

time.Comment: 16 pages (+13 pages of Appendix), 4 figures, accepted to SPIRE 201

arXiv.org e-Print Archive

Crossref

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Palindromic Length of Words with Many Periodic Palindromes

Author: A Frid
A Saarela
AE Frid
AE Frid
D Kosolobov
G Fici
M Bucci
M Rubinchik
P Ambrož
P Ambrož
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/05/2020
Field of study

The palindromic length

\text{PL}(v)

of a finite word

v

is the minimal number of palindromes whose concatenation is equal to

v

. In 2013, Frid, Puzynina, and Zamboni conjectured that: If

w

is an infinite word and

k

is an integer such that

\text{PL}(u)\leq k

for every factor

u

w

then

w

is ultimately periodic. Suppose that

w

is an infinite word and

k

is an integer such

\text{PL}(u)\leq k

for every factor

u

w

. Let

\Omega(w,k)

be the set of all factors

u

w

that have more than

\sqrt[k]{k^{-1}\vert u\vert}

palindromic prefixes. We show that

\Omega(w,k)

is an infinite set and we show that for each positive integer

j

there are palindromes

a,b

and a word

u\in \Omega(w,k)

such that

(ab)^j

is a factor of

u

and

b

is nonempty. Note that

(ab)^j

is a periodic word and

(ab)^ia

is a palindrome for each

i\leq j

. These results justify the following question: What is the palindromic length of a concatenation of a suffix of

b

and a periodic word

(ab)^j

with "many" periodic palindromes? It is known that

\lvert\text{PL}(uv)-\text{PL}(u)\rvert\leq \text{PL}(v)

, where

u

and

v

are nonempty words. The main result of our article shows that if

a,b

are palindromes,

b

is nonempty,

u

is a nonempty suffix of

b

\vert ab\vert

is the minimal period of

aba

, and

j

is a positive integer with

j\geq3\text{PL}(u)

then

\text{PL}(u(ab)^j)-\text{PL}(u)\geq 0

arXiv.org e-Print Archive

Crossref

Lempel–Ziv-Like Parsing in Small Space

Author: Kosolobov D.
Navarro G.
Puglisi S. J.
Valenzuela D.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Lempel–Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel–Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression nHk+ o(nlog σ) with k= o(log σn) , where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω (log n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is faster than existing alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in all tested scenarios, and sometimes below 1.05, to the size of LZ. © 2020, Springer Science+Business Media, LLC, part of Springer Nature.D. Kosolobov supported by the Russian Science Foundation (RSF), Project 18-71-00002 (for the upper bound analysis and a part of lower bound analysis). D. Valenzuela supported by the Academy of Finland (Grant 309048). G. Navarro funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile. S.J. Puglisi supported by the Academy of Finland (Grant 319454). This work started during Shonan Meeting 126 “Computation over Compressed Structured Data”. Funded in part by EU’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant Agreement No. 690941 (project BIRDS)

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Repositorio Académico de la Universidad de Chile

Palindromic Decompositions with Gaps and Errors

Author: A Apostolico
A Frid
D Breslauer
D Gusfield
D Kosolobov
DE Knuth
G Fici
G Manacher
M Crochemore
M Crochemore
M Rubinchik
R Kolpakov
S Gupta
T I
X Droubay
X Droubay
Y Fujishige
Z Galil
Publication venue
Publication date: 27/03/2017
Field of study

Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing gaps in decompositions and errors in palindromes, and also imposing a lower bound to the length of acceptable palindromes. We first present an algorithm for obtaining a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n * g) and space O(n g), where g is the number of allowed gaps in the decomposition. We then consider a decomposition of the string in maximal \delta-palindromes (i.e. palindromes with \delta errors under the edit or Hamming distance) and g allowed gaps. We present an algorithm to obtain such a decomposition with the minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201

arXiv.org e-Print Archive

Crossref

Palk is linear recognizable online

Author: Kosolobov D.
Rubinchik M.
Shur A. M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Given a language L that is online recognizable in linear time and space, we construct a linear time and space online recognition algorithm for the language L・Pal, where Pal is the language of all nonempty palindromes. Hence for every fixed positive k, Palk is online recognizable in linear time and space. Thus we solve an open problem posed by Galil and Seiferas in 1978. © Springer-Verlag Berlin Heidelberg 2015

CiteSeerX

Crossref

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

Author: C Hohlweg
CSJA Nash-Williams
D Kosolobov
GS Brodal
H Barcelo
J Fischer
M Crochemore
M Crochemore
M Crochemore
M Crochemore
M Crochemore
M Giraud
SJ Puglisi
W Rytter
W Rytter
Publication venue
Publication date: 01/01/2016
Field of study

Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an

O(n (\log n)^{2/3})

-time algorithm for answering

O(n)

LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to

O(n \log \log n)

time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any

n

such non-crossing queries can be answered on-line in

O(n \alpha(n))

time, which yields an

O(n \alpha(n))

-time algorithm for computing runs

arXiv.org e-Print Archive

Crossref

King's Research Portal

Hal-Diderot

HAL - UPEC / UPEM

Run compressed rank/select for large alphabets

Author: Fuentes-Sepulveda J.
Karkkainen J.
Kosolobov D.
Puglisi S.
Publication venue: IEEE
Publication date: 01/01/2018
Field of study

Given a string of length n that is composed of r runs of letters from the alphabet 0,1,..,σ-1 such that 2 ≤ σ ≤ r, we describe a data structure that, provided r ≤ n/log ω(1) n, stores the string in rlog nσ/r + o(r log nσ/r) bits and supports select and access queries in O(log log(n/r)/loglogn) time and rank queries in O(log log(nσ/r)/log time. We show that r log n(σ-1)/r-O(log n/r) bits are necessary for any such data structure and, thus, our solution is succinct. We also describe a data structure that uses (1 + ϵ)r log nσ/r + O(r) bits, where ϵ > 0 is an arbitrary constant, with the same query times but without the restriction r ≤ n/log ω(1) n. By simple reductions to the colored predecessor problem, we show that the query times are optimal in the important case r ≥ 2logδ n, for an arbitrary constant δ > 0. We implement our solution and compare it with the state of the art, showing that the closest competitors consume 31-46% more space. © 2018 IEEE.Peer reviewe

arXiv.org e-Print Archive

Crossref

Helsingin yliopiston digitaalinen arkisto

Cold intense electron beams from LN2-cooled GaAs-photocathodes

Author: Kosolobov S.
Orlov D.
Schwalm D.
Terekhov A.
Weigel U.
Wolf A.
Publication venue
Publication date: 01/01/2005
Field of study

To study electron-ion interactions at the Heidelberg heavy-ion storage ring, electron beams with low-energy spreads and dc-currents of milliamperes are desired. Measurements of the photoelectron energy distribution showed that electron beams with energy spreads of 5-8 meV can be obtained from GaAs photocathodes, cooled to about LN2-temperature. However, in order to get milliamperes beam currents, the laser illumination has to be increased up to 1 W, causing substantial cathode heating. The presented new electron gun design based on sapphire-substrate transmission-mode photocathodes, cooled by LN2, stabilizes the GaAs bulk temperature under 1 W laser illumination at about 95 K and thereby provides the prerequisites for an electron gun being operated at milliampere-currents with low-energy spreads

MPG.PuRe